Self-Supervised Learning for Real-World Object Detection: a Survey

Alina Ciocarlan; Arnaud Woiselle; Sidonie Lefebvre; Sylvie Le H\'egarat-Mascle

arxiv: 2410.07442 · v2 · submitted 2024-10-09 · 💻 cs.CV

Self-Supervised Learning for Real-World Object Detection: a Survey

Alina Ciocarlan , Sidonie Lefebvre , Sylvie Le H\'egarat-Mascle , Arnaud Woiselle This is my paper

Pith reviewed 2026-05-23 19:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised learningobject detectioninstance discriminationmasked image modelingCNNViTsmall object detectionremote sensing

0 comments

The pith

Instance discrimination SSL methods pair best with CNN encoders while masked image modeling suits ViT architectures for object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews self-supervised learning methods tailored for real-world object detection, with special attention to small objects in complex scenes. It contrasts instance discrimination approaches against masked image modeling and evaluates them across CNN and vision transformer backbones. Experiments on the COCO dataset and an infrared remote-sensing vehicle dataset indicate that instance discrimination benefits CNN encoders, whereas masked image modeling works better with ViTs and when pre-training occurs on custom uncurated data. The results supply a decision guide for choosing the SSL strategy and encoder combination to lift detection accuracy, especially under limited data or compute constraints.

Core claim

Instance discrimination methods perform well with CNN-based encoders, while MIM methods are better suited for ViT-based architectures and custom dataset pre-training. Choosing an appropriate SSL pre-training strategy along with a suitable encoder significantly enhances performance in real-world object detection, particularly for small object detection in frugal settings.

What carries the argument

Head-to-head comparison of instance discrimination versus masked image modeling SSL pre-training, each paired with either CNN or ViT encoders, measured on COCO and domain-specific infrared imagery for small-object detection accuracy.

If this is right

CNN-based detectors should default to instance discrimination pre-training to improve small-object recall.
ViT-based detectors and custom-domain pre-training should use masked image modeling instead.
The architecture-strategy matching yields measurable gains on real-world tasks such as infrared vehicle detection.
Practitioners can consult the survey table to pick the combination that matches their backbone and data constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported pairings may generalize to other dense-prediction tasks that also struggle with small objects.
Repeating the benchmarks on additional small-object domains such as medical or aerial imagery would test the stability of the CNN-versus-ViT rule.
Future SSL designs could combine elements of both instance discrimination and masked image modeling to reduce architecture dependence.

Load-bearing premise

The reported performance differences between SSL strategies and encoders arise purely from those choices rather than from uncontrolled differences in training hyperparameters, data curation, or object-size distributions across runs.

What would settle it

A re-run of the COCO and infrared benchmarks that fixes every training detail except the SSL method and encoder type, then shows no consistent accuracy gap between the claimed best pairings.

Figures

Figures reproduced from arXiv: 2410.07442 by Alina Ciocarlan, Arnaud Woiselle, Sidonie Lefebvre, Sylvie Le H\'egarat-Mascle.

**Figure 2.** Figure 2: Example of object-level instance discrimination pipeline. Here, we represented the ReSim framework, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Dense instance discrimination loss. proposed: a) Geometric alignment: VaDeR [32], PixContrast [13], PixPro [13], DUPR [33], InsCon [31], Leopart [16], LC-Loss [34] and CLOVE [35] assume that the geometric transforms between the positive images are known (thanks to the knowledge of the data-augmentation process), and use them to perform spatial alignment. Leopart [16] additionally relies on the attention m… view at source ↗

**Figure 4.** Figure 4: Common masking strategies for masked image modelling. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Self-Supervised Learning (SSL) has emerged as a promising approach in computer vision, enabling networks to learn meaningful representations from large unlabeled datasets. SSL methods fall into two main categories: instance discrimination and Masked Image Modeling (MIM). While instance discrimination is fundamental to SSL, it was originally designed for classification and may be less effective for object detection, particularly for small objects. In this survey, we focus on SSL methods specifically tailored for real-world object detection, with an emphasis on detecting small objects in complex environments. Unlike previous surveys, we offer a detailed comparison of SSL strategies, including object-level instance discrimination and MIM methods, and assess their effectiveness for small object detection using both CNN and ViT-based architectures. Specifically, our benchmark is performed on the widely-used COCO dataset, as well as on a specialized real-world dataset focused on vehicle detection in infrared remote sensing imagery. We also assess the impact of pre-training on custom domain-specific datasets, highlighting how certain SSL strategies are better suited for handling uncurated data. Our findings highlight that instance discrimination methods perform well with CNN-based encoders, while MIM methods are better suited for ViT-based architectures and custom dataset pre-training. This survey provides a practical guide for selecting optimal SSL strategies, taking into account factors such as backbone architecture, object size, and custom pre-training requirements. Ultimately, we show that choosing an appropriate SSL pre-training strategy, along with a suitable encoder, significantly enhances performance in real-world object detection, particularly for small object detection in frugal settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Survey of SSL for object detection that adds benchmarks on COCO and infrared data, but the architecture-specific claims rest on comparisons whose fairness is unclear.

read the letter

The paper surveys self-supervised pre-training methods for object detection, with a focus on small objects and real-world settings like remote sensing. It reviews instance discrimination and masked image modeling approaches, then reports new side-by-side results on COCO plus a custom infrared vehicle detection set, including tests of custom-domain pre-training. The main takeaway offered is that instance discrimination pairs better with CNN encoders while MIM works better with ViTs, and that custom pre-training helps in frugal regimes.

Referee Report

1 major / 0 minor

Summary. The paper surveys self-supervised learning (SSL) methods for real-world object detection with emphasis on small objects in complex environments. It contrasts instance discrimination and masked image modeling (MIM) approaches, benchmarks them on COCO and an infrared remote-sensing vehicle detection dataset using both CNN and ViT encoders, evaluates the effect of custom domain-specific pre-training, and concludes that instance discrimination performs well with CNN encoders while MIM is better suited to ViT architectures and custom pre-training, yielding gains especially for small-object detection in frugal settings.

Significance. If the reported benchmarks fairly isolate SSL strategy and encoder effects, the survey supplies a practical selection guide for SSL pre-training in object detection that accounts for backbone type, object scale, and domain-specific data. The inclusion of an infrared remote-sensing benchmark and explicit attention to small-object and frugal regimes adds applied relevance beyond generic classification-focused SSL surveys.

major comments (1)

[Abstract and benchmark description] Abstract and benchmark description: the central claim that instance discrimination suits CNN encoders while MIM suits ViTs (and custom pre-training) rests on COCO and infrared dataset comparisons. The provided text supplies no indication that training schedules, augmentations, optimizer settings, or object-size stratified splits were held fixed across SSL variants and backbones; without such controls the observed differences could arise from confounding factors rather than the claimed strategy–architecture interaction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address the single major comment below and commit to revisions that clarify the experimental protocol without altering the reported findings.

read point-by-point responses

Referee: [Abstract and benchmark description] Abstract and benchmark description: the central claim that instance discrimination suits CNN encoders while MIM suits ViTs (and custom pre-training) rests on COCO and infrared dataset comparisons. The provided text supplies no indication that training schedules, augmentations, optimizer settings, or object-size stratified splits were held fixed across SSL variants and backbones; without such controls the observed differences could arise from confounding factors rather than the claimed strategy–architecture interaction.

Authors: We agree that the abstract and high-level benchmark description do not explicitly enumerate the controls. The full manuscript's experimental section standardizes training epochs, batch size, optimizer, and learning-rate schedule across all SSL variants for a given backbone, re-uses the same augmentation pipeline from the original SSL papers where feasible, and evaluates on the official COCO small/medium/large object-size splits. Nevertheless, to eliminate any ambiguity we will (i) expand the abstract with a sentence on controlled variables and (ii) insert a dedicated paragraph in the benchmark description that lists the fixed hyperparameters and confirms object-size stratification. These additions will make the isolation of SSL-strategy and encoder effects explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: survey with external benchmarks

full rationale

This is a literature survey plus new benchmark results on COCO and an infrared remote-sensing dataset. No equations, derivations, fitted parameters, or self-citation chains appear in the provided text. Claims rest on reported experimental outcomes and cited prior work rather than reducing to self-definition or fitted inputs by construction. The paper is self-contained against external benchmarks and therefore receives the default non-circularity outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey with added benchmarks, the work rests on standard domain assumptions of self-supervised learning and computer-vision evaluation rather than new postulates. No free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Self-supervised pre-training on unlabeled data yields representations transferable to downstream object detection
Invoked throughout the abstract as the premise enabling the surveyed methods.

pith-pipeline@v0.9.0 · 5819 in / 1110 out tokens · 22907 ms · 2026-05-23T19:00:10.553797+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 3 internal anchors

[1]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755

work page 2014
[2]

Vehicle detection in aerial imagery: A small target detection benchmark,

S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,” Journal of Visual Communication and Image Representation, vol. 34, pp. 187–203, 2016

work page 2016
[3]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020
[4]

Bootstrap your own latent- a new approach to self-supervised learning,

J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent- a new approach to self-supervised learning,” Advances in neural information processing systems, vol. 33, pp. 21 271– 21 284, 2020

work page 2020
[5]

Emerging properties in self-supervised vision transformers,

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660

work page 2021
[6]

A Survey on Contrastive Self- supervised Learning,

A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A Survey on Contrastive Self- supervised Learning,” arXiv:2011.00362, Feb. 2021. [Online]. Available: http://arxiv.org/abs/2011.00362

work page arXiv 2011
[7]

Know your self-supervised learning: A survey on image- based generative and discriminative training,

U. Ozbulak, H. J. Lee, B. Boga, E. T. Anzaku, H. Park, A. Van Messem, W. De Neve, and J. Vankerschaver, “Know your self-supervised learning: A survey on image- based generative and discriminative training,”arXiv preprint arXiv:2305.13689, 2023

work page arXiv 2023
[8]

A survey on self-supervised learning: Algorithms, applications, and future trends,

J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao, “A survey on self-supervised learning: Algorithms, applications, and future trends,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[9]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

work page 2022
[10]

Semantic understanding of scenes through the ade20k dataset,

B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision, vol. 127, pp. 302–321, 2019. 17

work page 2019
[11]

Region similarity representation learning,

T. Xiao, C. J. Reed, X. Wang, K. Keutzer, and T. Darrell, “Region similarity representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 539–10 548

work page 2021
[12]

Dense contrastive learning for self-supervised visual pre-training,

X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li, “Dense contrastive learning for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3024–3033

work page 2021
[13]

Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,

Z. Xie, Y . Lin, Z. Zhang, Y . Cao, S. Lin, and H. Hu, “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 684–16 693

work page 2021
[14]

Ob- ject discovery and representation networks,

O. J. H´enaff, S. Koppula, E. Shelhamer, D. Zoran, A. Jaegle, A. Zisserman, J. Carreira, and R. Arandjelovi ´c, “Ob- ject discovery and representation networks,” in European Conference on Computer Vision. Springer, 2022, pp. 123– 143

work page 2022
[15]

Exploring set similarity for dense self-supervised representation learning,

Z. Wang, Q. Li, G. Zhang, P. Wan, W. Zheng, N. Wang, M. Gong, and T. Liu, “Exploring set similarity for dense self-supervised representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 590–16 599

work page 2022
[16]

Self-supervised learning of object parts for semantic segmentation,

A. Ziegler and Y . M. Asano, “Self-supervised learning of object parts for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 502–14 511

work page 2022
[17]

Bench- marking detection transfer learning with vision transformers,

Y . Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Benchmarking detection transfer learning with vision transformers,” arXiv preprint arXiv:2111.11429, 2021

work page arXiv 2021
[18]

What do self-supervised vision transformers learn?

N. Park, W. Kim, B. Heo, T. Kim, and S. Yun, “What do self-supervised vision transformers learn?” in The Eleventh International Conference on Learning Representations, 2022

work page 2022
[19]

Revealing the dark secrets of masked image modeling,

Z. Xie, Z. Geng, J. Hu, Z. Zhang, H. Hu, and Y . Cao, “Revealing the dark secrets of masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 475–14 485

work page 2023
[20]

Observation, analysis, and solution: Exploring strong lightweight vision transformers via masked image modeling pre-training,

J. Gao, S. Lin, S. Wang, Y . Kou, Z. Li, L. Li, C. Zhang, X. Zhang, Y . Wang, and W. Hu, “Observation, analysis, and solution: Exploring strong lightweight vision transformers via masked image modeling pre-training,” arXiv preprint arXiv:2404.12210, 2024

work page arXiv 2024
[21]

A survey of self-supervised and few-shot object detection,

G. Huang, I. Laradji, D. Vazquez, S. Lacoste-Julien, and P. Rodriguez, “A survey of self-supervised and few-shot object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4071–4089, 2022

work page 2022
[22]

Spatially consistent representation learning,

B. Roh, W. Shin, I. Kim, and S. Kim, “Spatially consistent representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1144–1153

work page 2021
[23]

Self-supervised visual representations learning by con- trastive mask prediction,

Y . Zhao, G. Wang, C. Luo, W. Zeng, and Z.-J. Zha, “Self-supervised visual representations learning by con- trastive mask prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 160–10 169

work page 2021
[24]

Aligning pretraining for detection via object-level contrastive learn- ing,

F. Wei, Y . Gao, Z. Wu, H. Hu, and S. Lin, “Aligning pretraining for detection via object-level contrastive learn- ing,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 682–22 694, 2021

work page 2021
[25]

Casting your model: Learning to localize improves self-supervised representations,

R. R. Selvaraju, K. Desai, J. Johnson, and N. Naik, “Casting your model: Learning to localize improves self-supervised representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 058–11 067

work page 2021
[26]

Crafting better contrastive views for siamese representation learning,

X. Peng, K. Wang, Z. Zhu, M. Wang, and Y . You, “Crafting better contrastive views for siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 031–16 040

work page 2022
[27]

Instance localization for self-supervised detection pretraining,

C. Yang, Z. Wu, B. Zhou, and S. Lin, “Instance localization for self-supervised detection pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3987–3996

work page 2021
[28]

Cp 2: Copy-paste contrastive pretraining for semantic seg- mentation,

F. Wang, H. Wang, C. Wei, A. Yuille, and W. Shen, “Cp 2: Copy-paste contrastive pretraining for semantic seg- mentation,” in European Conference on Computer Vision. Springer, 2022, pp. 499–515

work page 2022
[29]

Unsupervised object-level representation learning from scene images,

J. Xie, X. Zhan, Z. Liu, Y . S. Ong, and C. C. Loy, “Unsupervised object-level representation learning from scene images,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 864–28 876, 2021

work page 2021
[30]

Unsupervised learning of visual features by contrasting cluster assignments,

M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020

work page 2020
[31]

Inscon: Instance consistency feature representation via self-supervised learning,

J. Yang, K. Zhang, Z. Cui, J. Su, J. Luo, and X. Wei, “Inscon: Instance consistency feature representation via self-supervised learning,” arXiv preprint arXiv:2203.07688, 2022

work page arXiv 2022
[32]

Unsupervised learning of dense visual representations,

P. O. O Pinheiro, A. Almahairi, R. Benmalek, F. Golemo, and A. C. Courville, “Unsupervised learning of dense visual representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 4489–4500, 2020

work page 2020
[33]

Deeply unsupervised patch re-identification for pre- training object detectors,

J. Ding, E. Xie, H. Xu, C. Jiang, Z. Li, P. Luo, and G.-S. Xia, “Deeply unsupervised patch re-identification for pre- training object detectors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

work page 2022
[34]

Self-supervised learning with local contrastive loss for detection and semantic segmentation,

A. Islam, B. Lundell, H. Sawhney, S. N. Sinha, P. Morales, and R. J. Radke, “Self-supervised learning with local contrastive loss for detection and semantic segmentation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5624–5633

work page 2023
[35]

Self-supervised learning of contextualized local visual embeddings,

T. Silva, H. Pedrini, and A. Ram ´ırez, “Self-supervised learning of contextualized local visual embeddings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 177–186

work page 2023
[36]

Self-emd: Self-supervised object detection without imagenet,

S. Liu, Z. Li, and J. Sun, “Self-emd: Self-supervised object detection without imagenet,” arXiv preprint arXiv:2011.13677, 2020

work page arXiv 2011
[37]

Vicregl: Self- supervised learning of local visual features,

A. Bardes, J. Ponce, and Y . LeCun, “Vicregl: Self- supervised learning of local visual features,” Advances in Neural Information Processing Systems, vol. 35, pp. 8799– 8810, 2022

work page 2022
[38]

Efficient visual pretraining with contrastive detection,

O. J. H ´enaff, S. Koppula, J.-B. Alayrac, A. Van den Oord, O. Vinyals, and J. Carreira, “Efficient visual pretraining with contrastive detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 086–10 096

work page 2021
[39]

Efficient 18 graph-based image segmentation,

P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient 18 graph-based image segmentation,” International journal of computer vision, vol. 59, pp. 167–181, 2004

work page 2004
[40]

Beit: Bert pre- training of image transformers,

H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre- training of image transformers,” in International Conference on Learning Representations, 2021

work page 2021
[41]

Image bert pre-training with online tokenizer,

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “Image bert pre-training with online tokenizer,” in International Conference on Learning Representations, 2021

work page 2021
[42]

Self-supervised learning from images with a joint-embedding predictive architecture,

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 619–15 629

work page 2023
[43]

Simmim: A simple framework for masked image modeling,

Z. Xie, Z. Zhang, Y . Cao, Y . Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9653–9663

work page 2022
[44]

Context encoders: Feature learning by inpainting,

D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544

work page 2016
[45]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

work page 2009
[46]

You only train once: Learning a general anomaly enhancement network with random masks for hyperspectral anomaly detection,

Z. Li, Y . Wang, C. Xiao, Q. Ling, Z. Lin, and W. An, “You only train once: Learning a general anomaly enhancement network with random masks for hyperspectral anomaly detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–18, 2023

work page 2023
[47]

The inaturalist species classification and detection dataset,

G. Van Horn, O. Mac Aodha, Y . Song, Y . Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The inaturalist species classification and detection dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8769–8778

work page 2018
[48]

Corrupted image modeling for self-supervised visual pre-training,

Y . Fang, L. Dong, H. Bao, X. Wang, and F. Wei, “Corrupted image modeling for self-supervised visual pre-training,” in The Eleventh International Conference on Learning Representations, 2022

work page 2022
[49]

Pixmim: Rethinking pixel reconstruction in masked image modeling,

Y . Liu, S. Zhang, J. Chen, K. Chen, and D. Lin, “Pixmim: Rethinking pixel reconstruction in masked image modeling,” Transactions on Machine Learning Research, 2024

work page 2024
[50]

Mst: Masked self- supervised transformer for visual representation,

Z. Li, Z. Chen, F. Yang, W. Li, Y . Zhu, C. Zhao, R. Deng, L. Wu, R. Zhao, M. Tang et al., “Mst: Masked self- supervised transformer for visual representation,” Advances in Neural Information Processing Systems, vol. 34, pp. 13 165–13 176, 2021

work page 2021
[51]

What to hide from your students: Attention-guided masked image modeling,

I. Kakogeorgiou, S. Gidaris, B. Psomas, Y . Avrithis, A. Bur- suc, K. Karantzalos, and N. Komodakis, “What to hide from your students: Attention-guided masked image modeling,” in European Conference on Computer Vision. Springer, 2022, pp. 300–318

work page 2022
[52]

Good helper is around you: Attention-driven masked image modeling,

Z. Liu, J. Gui, and H. Luo, “Good helper is around you: Attention-driven masked image modeling,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1799–1807

work page 2023
[53]

Milan: Masked image pretraining on language assisted representation,

Z. Hou, F. Sun, Y .-K. Chen, Y . Xie, and S.-Y . Kung, “Milan: Masked image pretraining on language assisted representation,” arXiv preprint arXiv:2208.06049, 2022

work page arXiv 2022
[54]

Learning transferable visual models from natural lan- guage supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural lan- guage supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

work page 2021
[55]

Semmae: Semantic-guided masking for learning masked autoencoders,

G. Li, H. Zheng, D. Liu, C. Wang, B. Su, and C. Zheng, “Semmae: Semantic-guided masking for learning masked autoencoders,” Advances in Neural Information Processing Systems, vol. 35, pp. 14 290–14 302, 2022

work page 2022
[56]

Dppmask: Masked image mod- eling with determinantal point processes,

J. Xu, Z. Lin, D. Zhou, Y . Yang, X. Liao, Q. Wang, B. Wu, G. Chen, and P.-A. Heng, “Dppmask: Masked image mod- eling with determinantal point processes,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2266–2276

work page 2024
[57]

Extracting and composing robust features with denoising autoencoders,

P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103

work page 2008
[58]

The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training,

H. Liu, X. Jiang, X. Li, A. Guo, Y . Hu, D. Jiang, and B. Ren, “The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1649–1656

work page 2023
[59]

Architecture- agnostic masked image modeling from vit back to cnn,

S. Li, D. Wu, F. Wu, Z. Zang, and S. Z. Li, “Architecture- agnostic masked image modeling from vit back to cnn,” in International Conference on Machine Learning. PMLR, 2023, pp. 20 149–20 167

work page 2023
[60]

Masked feature prediction for self-supervised visual pre-training,

C. Wei, H. Fan, S. Xie, C.-Y . Wu, A. Yuille, and C. Feichten- hofer, “Masked feature prediction for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 668–14 678

work page 2022
[61]

Self- supervised masking for unsupervised anomaly detection and localization,

C. Huang, Q. Xu, Y . Wang, Y . Wang, and Y . Zhang, “Self- supervised masking for unsupervised anomaly detection and localization,” IEEE Transactions on Multimedia, 2022

work page 2022
[62]

A unified view of masked image modeling,

Z. Peng, L. Dong, H. Bao, F. Wei, and Q. Ye, “A unified view of masked image modeling,” Transactions on Machine Learning Research, 2022

work page 2022
[63]

Stare at what you see: Masked image modeling without reconstruction,

H. Xue, P. Gao, H. Li, Y . Qiao, H. Sun, H. Li, and J. Luo, “Stare at what you see: Masked image modeling without reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 732–22 741

work page 2023
[64]

Are large-scale datasets necessary for self- supervised pre-training?

A. El-Nouby, G. Izacard, H. Touvron, I. Laptev, H. Jegou, and E. Grave, “Are large-scale datasets necessary for self- supervised pre-training?” arXiv preprint arXiv:2112.10740, 2021

work page arXiv 2021
[65]

Exploring target representations for masked autoencoders,

X. Liu, J. Zhou, T. Kong, X. Lin, and R. Ji, “Exploring target representations for masked autoencoders,” arXiv preprint arXiv:2209.03917, 2022

work page arXiv 2022
[66]

Designing bert for convolutional networks: Sparse and hier- archical masked modeling,

K. Tian, Y . Jiang, C. Lin, L. Wang, Z. Yuan et al., “Designing bert for convolutional networks: Sparse and hier- archical masked modeling,” in The Eleventh International Conference on Learning Representations, 2022

work page 2022
[67]

Convmae: Masked convolution meets masked autoencoders,

P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y . Qiao, “Convmae: Masked convolution meets masked autoencoders,” arXiv preprint arXiv:2205.03892, 2022

work page arXiv 2022
[68]

Mixmae: Mixed and masked autoencoder for efficient pretraining 19 of hierarchical vision transformers,

J. Liu, X. Huang, J. Zheng, Y . Liu, and H. Li, “Mixmae: Mixed and masked autoencoder for efficient pretraining 19 of hierarchical vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6252–6261

work page 2023
[69]

A convnet for the 2020s,

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986

work page 2022
[70]

Mask R-CNN

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” arXiv:1703.06870, Jan. 2018. [Online]. Available: http://arxiv.org/abs/1703.06870

work page internal anchor Pith review Pith/arXiv arXiv 2018
[71]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[72]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[73]

Unleashing vanilla vision transformer with masked image modeling for object detection,

Y . Fang, S. Yang, S. Wang, Y . Ge, Y . Shan, and X. Wang, “Unleashing vanilla vision transformer with masked image modeling for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6244–6253

work page 2023
[74]

Lsotb-tir: A large-scale high- diversity thermal infrared object tracking benchmark,

Q. Liu, X. Li, Z. He, C. Li, J. Li, Z. Zhou, D. Yuan, J. Li, K. Yang, N. Fan et al., “Lsotb-tir: A large-scale high- diversity thermal infrared object tracking benchmark,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 3847–3856

work page 2020
[75]

Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,

H. Sun, J. Bai, F. Yang, and X. Bai, “Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023

work page 2023
[76]

Flir data set dataset,

T. Imaging, “Flir data set dataset,” https://universe.roboflow. com/thermal-imaging-0hwfw/flir-data-set, mar 2024, visited on 2024-07-16. [Online]. Available: https://universe. roboflow.com/thermal-imaging-0hwfw/flir-data-set

work page 2024
[77]

Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,

H. Wang, L. Zhou, and L. Wang, “Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8509–8518

work page 2019
[78]

People detection and tracking from aerial thermal views,

J. Portmann, S. Lynen, M. Chli, and R. Siegwart, “People detection and tracking from aerial thermal views,” in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 1794–1800

work page 2014
[79]

Hit-uav: A high-altitude infrared thermal dataset for unmanned aerial vehicle-based object detection,

J. Suo, T. Wang, X. Zhang, H. Chen, W. Zhou, and W. Shi, “Hit-uav: A high-altitude infrared thermal dataset for unmanned aerial vehicle-based object detection,” Scientific Data, vol. 10, no. 1, p. 227, 2023

work page 2023
[80]

Isnet: Shape matters for infrared small target detection,

M. Zhang, R. Zhang, Y . Yang, H. Bai, J. Zhang, and J. Guo, “Isnet: Shape matters for infrared small target detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 877–886

work page 2022

Showing first 80 references.

[1] [1]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755

work page 2014

[2] [2]

Vehicle detection in aerial imagery: A small target detection benchmark,

S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,” Journal of Visual Communication and Image Representation, vol. 34, pp. 187–203, 2016

work page 2016

[3] [3]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020

[4] [4]

Bootstrap your own latent- a new approach to self-supervised learning,

J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent- a new approach to self-supervised learning,” Advances in neural information processing systems, vol. 33, pp. 21 271– 21 284, 2020

work page 2020

[5] [5]

Emerging properties in self-supervised vision transformers,

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660

work page 2021

[6] [6]

A Survey on Contrastive Self- supervised Learning,

A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A Survey on Contrastive Self- supervised Learning,” arXiv:2011.00362, Feb. 2021. [Online]. Available: http://arxiv.org/abs/2011.00362

work page arXiv 2011

[7] [7]

Know your self-supervised learning: A survey on image- based generative and discriminative training,

U. Ozbulak, H. J. Lee, B. Boga, E. T. Anzaku, H. Park, A. Van Messem, W. De Neve, and J. Vankerschaver, “Know your self-supervised learning: A survey on image- based generative and discriminative training,”arXiv preprint arXiv:2305.13689, 2023

work page arXiv 2023

[8] [8]

A survey on self-supervised learning: Algorithms, applications, and future trends,

J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao, “A survey on self-supervised learning: Algorithms, applications, and future trends,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[9] [9]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

work page 2022

[10] [10]

Semantic understanding of scenes through the ade20k dataset,

B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision, vol. 127, pp. 302–321, 2019. 17

work page 2019

[11] [11]

Region similarity representation learning,

T. Xiao, C. J. Reed, X. Wang, K. Keutzer, and T. Darrell, “Region similarity representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 539–10 548

work page 2021

[12] [12]

Dense contrastive learning for self-supervised visual pre-training,

X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li, “Dense contrastive learning for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3024–3033

work page 2021

[13] [13]

Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,

Z. Xie, Y . Lin, Z. Zhang, Y . Cao, S. Lin, and H. Hu, “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 684–16 693

work page 2021

[14] [14]

Ob- ject discovery and representation networks,

O. J. H´enaff, S. Koppula, E. Shelhamer, D. Zoran, A. Jaegle, A. Zisserman, J. Carreira, and R. Arandjelovi ´c, “Ob- ject discovery and representation networks,” in European Conference on Computer Vision. Springer, 2022, pp. 123– 143

work page 2022

[15] [15]

Exploring set similarity for dense self-supervised representation learning,

Z. Wang, Q. Li, G. Zhang, P. Wan, W. Zheng, N. Wang, M. Gong, and T. Liu, “Exploring set similarity for dense self-supervised representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 590–16 599

work page 2022

[16] [16]

Self-supervised learning of object parts for semantic segmentation,

A. Ziegler and Y . M. Asano, “Self-supervised learning of object parts for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 502–14 511

work page 2022

[17] [17]

Bench- marking detection transfer learning with vision transformers,

Y . Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Benchmarking detection transfer learning with vision transformers,” arXiv preprint arXiv:2111.11429, 2021

work page arXiv 2021

[18] [18]

What do self-supervised vision transformers learn?

N. Park, W. Kim, B. Heo, T. Kim, and S. Yun, “What do self-supervised vision transformers learn?” in The Eleventh International Conference on Learning Representations, 2022

work page 2022

[19] [19]

Revealing the dark secrets of masked image modeling,

Z. Xie, Z. Geng, J. Hu, Z. Zhang, H. Hu, and Y . Cao, “Revealing the dark secrets of masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 475–14 485

work page 2023

[20] [20]

Observation, analysis, and solution: Exploring strong lightweight vision transformers via masked image modeling pre-training,

J. Gao, S. Lin, S. Wang, Y . Kou, Z. Li, L. Li, C. Zhang, X. Zhang, Y . Wang, and W. Hu, “Observation, analysis, and solution: Exploring strong lightweight vision transformers via masked image modeling pre-training,” arXiv preprint arXiv:2404.12210, 2024

work page arXiv 2024

[21] [21]

A survey of self-supervised and few-shot object detection,

G. Huang, I. Laradji, D. Vazquez, S. Lacoste-Julien, and P. Rodriguez, “A survey of self-supervised and few-shot object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4071–4089, 2022

work page 2022

[22] [22]

Spatially consistent representation learning,

B. Roh, W. Shin, I. Kim, and S. Kim, “Spatially consistent representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1144–1153

work page 2021

[23] [23]

Self-supervised visual representations learning by con- trastive mask prediction,

Y . Zhao, G. Wang, C. Luo, W. Zeng, and Z.-J. Zha, “Self-supervised visual representations learning by con- trastive mask prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 160–10 169

work page 2021

[24] [24]

Aligning pretraining for detection via object-level contrastive learn- ing,

F. Wei, Y . Gao, Z. Wu, H. Hu, and S. Lin, “Aligning pretraining for detection via object-level contrastive learn- ing,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 682–22 694, 2021

work page 2021

[25] [25]

Casting your model: Learning to localize improves self-supervised representations,

R. R. Selvaraju, K. Desai, J. Johnson, and N. Naik, “Casting your model: Learning to localize improves self-supervised representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 058–11 067

work page 2021

[26] [26]

Crafting better contrastive views for siamese representation learning,

X. Peng, K. Wang, Z. Zhu, M. Wang, and Y . You, “Crafting better contrastive views for siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 031–16 040

work page 2022

[27] [27]

Instance localization for self-supervised detection pretraining,

C. Yang, Z. Wu, B. Zhou, and S. Lin, “Instance localization for self-supervised detection pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3987–3996

work page 2021

[28] [28]

Cp 2: Copy-paste contrastive pretraining for semantic seg- mentation,

F. Wang, H. Wang, C. Wei, A. Yuille, and W. Shen, “Cp 2: Copy-paste contrastive pretraining for semantic seg- mentation,” in European Conference on Computer Vision. Springer, 2022, pp. 499–515

work page 2022

[29] [29]

Unsupervised object-level representation learning from scene images,

J. Xie, X. Zhan, Z. Liu, Y . S. Ong, and C. C. Loy, “Unsupervised object-level representation learning from scene images,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 864–28 876, 2021

work page 2021

[30] [30]

Unsupervised learning of visual features by contrasting cluster assignments,

M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020

work page 2020

[31] [31]

Inscon: Instance consistency feature representation via self-supervised learning,

J. Yang, K. Zhang, Z. Cui, J. Su, J. Luo, and X. Wei, “Inscon: Instance consistency feature representation via self-supervised learning,” arXiv preprint arXiv:2203.07688, 2022

work page arXiv 2022

[32] [32]

Unsupervised learning of dense visual representations,

P. O. O Pinheiro, A. Almahairi, R. Benmalek, F. Golemo, and A. C. Courville, “Unsupervised learning of dense visual representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 4489–4500, 2020

work page 2020

[33] [33]

Deeply unsupervised patch re-identification for pre- training object detectors,

J. Ding, E. Xie, H. Xu, C. Jiang, Z. Li, P. Luo, and G.-S. Xia, “Deeply unsupervised patch re-identification for pre- training object detectors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

work page 2022

[34] [34]

Self-supervised learning with local contrastive loss for detection and semantic segmentation,

A. Islam, B. Lundell, H. Sawhney, S. N. Sinha, P. Morales, and R. J. Radke, “Self-supervised learning with local contrastive loss for detection and semantic segmentation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5624–5633

work page 2023

[35] [35]

Self-supervised learning of contextualized local visual embeddings,

T. Silva, H. Pedrini, and A. Ram ´ırez, “Self-supervised learning of contextualized local visual embeddings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 177–186

work page 2023

[36] [36]

Self-emd: Self-supervised object detection without imagenet,

S. Liu, Z. Li, and J. Sun, “Self-emd: Self-supervised object detection without imagenet,” arXiv preprint arXiv:2011.13677, 2020

work page arXiv 2011

[37] [37]

Vicregl: Self- supervised learning of local visual features,

A. Bardes, J. Ponce, and Y . LeCun, “Vicregl: Self- supervised learning of local visual features,” Advances in Neural Information Processing Systems, vol. 35, pp. 8799– 8810, 2022

work page 2022

[38] [38]

Efficient visual pretraining with contrastive detection,

O. J. H ´enaff, S. Koppula, J.-B. Alayrac, A. Van den Oord, O. Vinyals, and J. Carreira, “Efficient visual pretraining with contrastive detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 086–10 096

work page 2021

[39] [39]

Efficient 18 graph-based image segmentation,

P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient 18 graph-based image segmentation,” International journal of computer vision, vol. 59, pp. 167–181, 2004

work page 2004

[40] [40]

Beit: Bert pre- training of image transformers,

H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre- training of image transformers,” in International Conference on Learning Representations, 2021

work page 2021

[41] [41]

Image bert pre-training with online tokenizer,

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “Image bert pre-training with online tokenizer,” in International Conference on Learning Representations, 2021

work page 2021

[42] [42]

Self-supervised learning from images with a joint-embedding predictive architecture,

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 619–15 629

work page 2023

[43] [43]

Simmim: A simple framework for masked image modeling,

Z. Xie, Z. Zhang, Y . Cao, Y . Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9653–9663

work page 2022

[44] [44]

Context encoders: Feature learning by inpainting,

D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544

work page 2016

[45] [45]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255

work page 2009

[46] [46]

You only train once: Learning a general anomaly enhancement network with random masks for hyperspectral anomaly detection,

Z. Li, Y . Wang, C. Xiao, Q. Ling, Z. Lin, and W. An, “You only train once: Learning a general anomaly enhancement network with random masks for hyperspectral anomaly detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–18, 2023

work page 2023

[47] [47]

The inaturalist species classification and detection dataset,

G. Van Horn, O. Mac Aodha, Y . Song, Y . Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie, “The inaturalist species classification and detection dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8769–8778

work page 2018

[48] [48]

Corrupted image modeling for self-supervised visual pre-training,

Y . Fang, L. Dong, H. Bao, X. Wang, and F. Wei, “Corrupted image modeling for self-supervised visual pre-training,” in The Eleventh International Conference on Learning Representations, 2022

work page 2022

[49] [49]

Pixmim: Rethinking pixel reconstruction in masked image modeling,

Y . Liu, S. Zhang, J. Chen, K. Chen, and D. Lin, “Pixmim: Rethinking pixel reconstruction in masked image modeling,” Transactions on Machine Learning Research, 2024

work page 2024

[50] [50]

Mst: Masked self- supervised transformer for visual representation,

Z. Li, Z. Chen, F. Yang, W. Li, Y . Zhu, C. Zhao, R. Deng, L. Wu, R. Zhao, M. Tang et al., “Mst: Masked self- supervised transformer for visual representation,” Advances in Neural Information Processing Systems, vol. 34, pp. 13 165–13 176, 2021

work page 2021

[51] [51]

What to hide from your students: Attention-guided masked image modeling,

I. Kakogeorgiou, S. Gidaris, B. Psomas, Y . Avrithis, A. Bur- suc, K. Karantzalos, and N. Komodakis, “What to hide from your students: Attention-guided masked image modeling,” in European Conference on Computer Vision. Springer, 2022, pp. 300–318

work page 2022

[52] [52]

Good helper is around you: Attention-driven masked image modeling,

Z. Liu, J. Gui, and H. Luo, “Good helper is around you: Attention-driven masked image modeling,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1799–1807

work page 2023

[53] [53]

Milan: Masked image pretraining on language assisted representation,

Z. Hou, F. Sun, Y .-K. Chen, Y . Xie, and S.-Y . Kung, “Milan: Masked image pretraining on language assisted representation,” arXiv preprint arXiv:2208.06049, 2022

work page arXiv 2022

[54] [54]

Learning transferable visual models from natural lan- guage supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural lan- guage supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

work page 2021

[55] [55]

Semmae: Semantic-guided masking for learning masked autoencoders,

G. Li, H. Zheng, D. Liu, C. Wang, B. Su, and C. Zheng, “Semmae: Semantic-guided masking for learning masked autoencoders,” Advances in Neural Information Processing Systems, vol. 35, pp. 14 290–14 302, 2022

work page 2022

[56] [56]

Dppmask: Masked image mod- eling with determinantal point processes,

J. Xu, Z. Lin, D. Zhou, Y . Yang, X. Liao, Q. Wang, B. Wu, G. Chen, and P.-A. Heng, “Dppmask: Masked image mod- eling with determinantal point processes,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 2266–2276

work page 2024

[57] [57]

Extracting and composing robust features with denoising autoencoders,

P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103

work page 2008

[58] [58]

The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training,

H. Liu, X. Jiang, X. Li, A. Guo, Y . Hu, D. Jiang, and B. Ren, “The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1649–1656

work page 2023

[59] [59]

Architecture- agnostic masked image modeling from vit back to cnn,

S. Li, D. Wu, F. Wu, Z. Zang, and S. Z. Li, “Architecture- agnostic masked image modeling from vit back to cnn,” in International Conference on Machine Learning. PMLR, 2023, pp. 20 149–20 167

work page 2023

[60] [60]

Masked feature prediction for self-supervised visual pre-training,

C. Wei, H. Fan, S. Xie, C.-Y . Wu, A. Yuille, and C. Feichten- hofer, “Masked feature prediction for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 668–14 678

work page 2022

[61] [61]

Self- supervised masking for unsupervised anomaly detection and localization,

C. Huang, Q. Xu, Y . Wang, Y . Wang, and Y . Zhang, “Self- supervised masking for unsupervised anomaly detection and localization,” IEEE Transactions on Multimedia, 2022

work page 2022

[62] [62]

A unified view of masked image modeling,

Z. Peng, L. Dong, H. Bao, F. Wei, and Q. Ye, “A unified view of masked image modeling,” Transactions on Machine Learning Research, 2022

work page 2022

[63] [63]

Stare at what you see: Masked image modeling without reconstruction,

H. Xue, P. Gao, H. Li, Y . Qiao, H. Sun, H. Li, and J. Luo, “Stare at what you see: Masked image modeling without reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 732–22 741

work page 2023

[64] [64]

Are large-scale datasets necessary for self- supervised pre-training?

A. El-Nouby, G. Izacard, H. Touvron, I. Laptev, H. Jegou, and E. Grave, “Are large-scale datasets necessary for self- supervised pre-training?” arXiv preprint arXiv:2112.10740, 2021

work page arXiv 2021

[65] [65]

Exploring target representations for masked autoencoders,

X. Liu, J. Zhou, T. Kong, X. Lin, and R. Ji, “Exploring target representations for masked autoencoders,” arXiv preprint arXiv:2209.03917, 2022

work page arXiv 2022

[66] [66]

Designing bert for convolutional networks: Sparse and hier- archical masked modeling,

K. Tian, Y . Jiang, C. Lin, L. Wang, Z. Yuan et al., “Designing bert for convolutional networks: Sparse and hier- archical masked modeling,” in The Eleventh International Conference on Learning Representations, 2022

work page 2022

[67] [67]

Convmae: Masked convolution meets masked autoencoders,

P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y . Qiao, “Convmae: Masked convolution meets masked autoencoders,” arXiv preprint arXiv:2205.03892, 2022

work page arXiv 2022

[68] [68]

Mixmae: Mixed and masked autoencoder for efficient pretraining 19 of hierarchical vision transformers,

J. Liu, X. Huang, J. Zheng, Y . Liu, and H. Li, “Mixmae: Mixed and masked autoencoder for efficient pretraining 19 of hierarchical vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6252–6261

work page 2023

[69] [69]

A convnet for the 2020s,

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986

work page 2022

[70] [70]

Mask R-CNN

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” arXiv:1703.06870, Jan. 2018. [Online]. Available: http://arxiv.org/abs/1703.06870

work page internal anchor Pith review Pith/arXiv arXiv 2018

[71] [71]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[72] [72]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

P. Goyal, P. Doll ´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y . Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[73] [73]

Unleashing vanilla vision transformer with masked image modeling for object detection,

Y . Fang, S. Yang, S. Wang, Y . Ge, Y . Shan, and X. Wang, “Unleashing vanilla vision transformer with masked image modeling for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6244–6253

work page 2023

[74] [74]

Lsotb-tir: A large-scale high- diversity thermal infrared object tracking benchmark,

Q. Liu, X. Li, Z. He, C. Li, J. Li, Z. Zhou, D. Yuan, J. Li, K. Yang, N. Fan et al., “Lsotb-tir: A large-scale high- diversity thermal infrared object tracking benchmark,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 3847–3856

work page 2020

[75] [75]

Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,

H. Sun, J. Bai, F. Yang, and X. Bai, “Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset irdst,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023

work page 2023

[76] [76]

Flir data set dataset,

T. Imaging, “Flir data set dataset,” https://universe.roboflow. com/thermal-imaging-0hwfw/flir-data-set, mar 2024, visited on 2024-07-16. [Online]. Available: https://universe. roboflow.com/thermal-imaging-0hwfw/flir-data-set

work page 2024

[77] [77]

Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,

H. Wang, L. Zhou, and L. Wang, “Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8509–8518

work page 2019

[78] [78]

People detection and tracking from aerial thermal views,

J. Portmann, S. Lynen, M. Chli, and R. Siegwart, “People detection and tracking from aerial thermal views,” in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 1794–1800

work page 2014

[79] [79]

Hit-uav: A high-altitude infrared thermal dataset for unmanned aerial vehicle-based object detection,

J. Suo, T. Wang, X. Zhang, H. Chen, W. Zhou, and W. Shi, “Hit-uav: A high-altitude infrared thermal dataset for unmanned aerial vehicle-based object detection,” Scientific Data, vol. 10, no. 1, p. 227, 2023

work page 2023

[80] [80]

Isnet: Shape matters for infrared small target detection,

M. Zhang, R. Zhang, Y . Yang, H. Bai, J. Zhang, and J. Guo, “Isnet: Shape matters for infrared small target detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 877–886

work page 2022